lm(), geom_smooth(), geom_ribbon(), plot.lm(), coef(), anova(), broom::tidy(), predict(), ggPredict(), expand.grid(), expand(), broom::augment(), AIC(), step(), vif()
Linear regression is one of the most commonly-used and easy-to-understand approaches to modeling. Linear regression involves a numerical outcome variable y and explanatory variable(s) x that are either numerical or categorical. We will start with simple (univariate) linear regression where y is modeled as a function of a single x variable.
The mathematical equation for simple regression is as follows:
y = B1 + B2*x + Error where B1 is the y-intercept, B2 is
the slope and Error is the error term, or the portion of y that the
regression model can’t explain.
Linear regression has 5 key assumptions:
Additionally, a good sample size rule of thumb is that the regression analysis requires at least 20 cases per independent variable in the analysis.
Here are some plots you want to look at before building a linear regression model:
GGally::ggpairs()) to check
for multicollinearity in your explanatory variablesIf you do encounter some problems with your data, there are many
solutions that can help make linear regression an appropriate analysis.
For example, if your explanatory variables aren’t normal or you have
heteroscedasticity, a nonlinear transformation (such as
log(x), x^2 or sqrt(x) ) may
solve the issue. Heteroscedasticity can also be dealt with by
calculating robust standard errors for your linear model coefficients.
If you have some nasty outliers, think about whether you might be
(scientifically) justified in removing them from the dataset. If several
of your explanatory variables are correlated, you can consider removing
some of them using stepwise regression. These will be covered in detail
in a statistics class.
Let’s use the palmerpenguins data to look again at the
relationship between bill length and bill depth. I’ll do a little
exploratory data analysis by viewing the first few data points and
calculating summary statistics, then I’ll look at the density plots and
correlation with GGally::ggpairs()
library(tidyverse)
library(palmerpenguins)
library(GGally) # ggPairs()
library(ggiraph)
library(ggiraphExtra) # ggPredict()
library(broom) # tidy() augment() #does NOT load with tidyverse
library(car) # vif()
# Exploratory data analysis:
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
summarize(penguins)
## # A tibble: 1 × 0
# visual correlation matrix for all continuous variables
penguins %>%
select(species, body_mass_g, ends_with("_mm")) %>% # select variables with names that end in "_mm"
GGally::ggpairs(aes(color = species))
# visual correlation matrix for bill depth and bill length
penguins %>%
select(bill_depth_mm, bill_length_mm) %>%
GGally::ggpairs() # calling out the library can avoid ambiguity for common-named functions, or just serve as a reminder to you
The linearity of the explanatory variable bill_depth_mm
and the independent variable bill_depth_mm doesn’t look
very promising to me. Let’s see what happens if we actually build the
linear regression. We’ll use the function lm() which stands
for “linear model” and we’ll indicate the relationship we want to test
by putting a formula of the format y~x as a parameter.
We’ll save the model results in a variable, then use the
summary() function to display the model results:
lm_1 = lm(bill_depth_mm ~ bill_length_mm, data=penguins)
summary(lm_1)
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1381 -1.4263 0.0164 1.3841 4.5255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.88547 0.84388 24.749 < 2e-16 ***
## bill_length_mm -0.08502 0.01907 -4.459 1.12e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.922 on 340 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.05525, Adjusted R-squared: 0.05247
## F-statistic: 19.88 on 1 and 340 DF, p-value: 1.12e-05
Both coefficients B1 (the y-intercept) and B2 (the slope associated
with bill length) are statistically significant with
p<0.05. However, the R-squared = 0.05, which means that
bill length only explains about 5% of the variance in the bill depth
observations. That’s pretty pathetic. Since this is just a simple
univariate regression model we can quickly plot the data and the linear
model using geom_smooth() with method="lm" in
ggplot.
ggplot(data=penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
There are clear clusters in the data that are being ignored by our regression model, and the line doesn’t seem to capture any interesting trend. The negative coefficient for bill_length_mm (as well as the plotted model) indicates that as bill length increases, bill depth decreases. That doesn’t really make sense, right? What did we do wrong?
Well, grouping all of the penguin data, i.e. all three species, is pretty illogical. In fact, it violates another, less cited, assumption of linear regression: “All necessary independent variables are included in the regression that are specified by existing theory and/or research.” We probably should have included species, or separated our analysis out into three separate models, one for each species.
To check some of the assumptions of your linear model post
hoc, you can send your saved model to the plot()
function in base R:
class(lm_1) # Note that your model output is a variable of the class "lm"
## [1] "lm"
plot(lm_1) # This actually calls plot.lm() since the first parameter is class "lm"
You can learn more about this quick-and-dirty diagnostics plotting
trick by looking up the help page ?plot.lm. The function
plot.lm() is the version of the plot()
function that is called when the parameter that you pass
plot() is of the class “lm”. This output provides you four
useful plots:
There are many objective statistical tests that can be performed to check the assumptions of your data. For more resources, look at the end of this tutorial page.
Let’s do the logical thing and test the same relationship
bill_depth_mm ~ bill_length_mm but only looking at one
species:
gentoo = penguins %>% filter(species=="Gentoo")
# visual correlation matrix for bill depth and bill length
gentoo %>%
select(bill_depth_mm, bill_length_mm) %>%
GGally::ggpairs()
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_density()`).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_density()`).
lm_2 = lm(bill_depth_mm ~ bill_length_mm, data=gentoo)
summary(lm_2)
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm, data = gentoo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.55952 -0.52572 -0.06658 0.46041 2.95390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.25101 1.05481 4.978 2.15e-06 ***
## bill_length_mm 0.20484 0.02216 9.245 1.02e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7543 on 121 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.4139, Adjusted R-squared: 0.4091
## F-statistic: 85.46 on 1 and 121 DF, p-value: 1.016e-15
ggplot(data=gentoo, aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
This trend looks a little better to me. The R-squared = 0.4139 (which is equal to the square of the Pearson correlation coefficient r) and the adjusted R-squared = 4.091 (here the R-squared value has been adjusted to account for model complexity, or the number of parameters in our linear model). So the model explains about 41% of the variation in bill depth. That’s pretty impressive considering we only have one variable in there. Also, the model passes the sanity check because it seems logical that bill depth will increase with increasing bill length.
We can actually use geom_smooth() to examine separate linear models for each of the three penguin species at once without formally running the regression:
ggplot(data=penguins) +
geom_point(aes(x=bill_length_mm, y=bill_depth_mm, color=species)) +
geom_smooth(aes(x=bill_length_mm, y=bill_depth_mm, color=species), method = "lm") +
geom_smooth(data=penguins, aes(x=bill_length_mm, y=bill_depth_mm), method = "lm", color="black")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
# Let's save this plot:
ggsave(filename="figures/bill_length_v_depth_each_species.png", height=5, width=7, units="in", dpi=300)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range (`stat_smooth()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
So that makes it very clear that running all three species together was a serious logical error, and it produced a completely untrue result that bill depth increases as bill length decreases. In fact, when we split the three species apart, we can see that bill depth increases with increasing bill length (as you’d expect) and the relationships look pretty similar for each species. This is an example of Simpson’s paradox, which occurs when trends that appear when a dataset is separated into groups reverse when the data are aggregated. Basically, when doing statistics, you want to use your brain and your gut to build models that make sense.
Build a linear model predicting Gentoo bill depth as a function of flipper length. Plot the predictions. Which explanatory variable (bill length vs. flipper length) does a better job of predicting bill depth? What is your evidence?
In life, we typically can’t do a very good job explaining variation in some dependent variable using just a single independent variable. Multiple linear regression is when we build a function with multiple independent variables of the form:
y = B0 + B1x1 + B2x2 + … + Error
When conducting multiple linear regression, all of the same assumptions and diagnostics apply, with one important addition: the interpretation of the associated effect of any one explanatory variable must be made in conjunction with the other explanatory variables included in your model.
The statistical decisions you make should account for your end goals. These are the two types of goals when building a model:
In this lesson, and in general when you are trying to “play it safe” in your analyses, you will be modeling for explanation. This means if you are modeling some y as a a function of x1 and x2, but x1 and x2 are quite collinear, then you won’t be able to differentiate the unique impact of either x variable on y because, for example, x1 may be stealing some of the variation from x2. Then the total impact of x2 on y will be unfairly biased small, and the impact of x1 will be unfairly biased large. Who knows what chaos will ensue when you use these biased models to guide science, policy, etc.?
However, if you are modeling for prediction, and don’t actually care what the relative impact of x1 or x2 is on y, but you want your y predictions to be as accurate and precise as possible, then you don’t have to worry about multicollinearity. For example, weather predictions are like this. Meteorologists just throw everything they can into their models, even though their “explanatory variables” can be highly correlated, becuase they aren’t trying to demonstrate a relationship between rainfall and pressure, they are just trying to let you know whether it’ll be a good weekend for a beach trip.
bill_depth_mm, bill_length_mmspecies. These should typically be in the
factor class in R.There are cases when you, the data scientist, can make a conscientious choice between assigning a variable as continuous vs. categorical. A good example is “year”. If you are looking for a trend in your data over time, year should be treated as a continuous variable. If you are trying to account for some wacky conditions that can change from year to year, but that probably don’t consitute a temporal trend, you could treat year as a categorical variable.
Our adventures in simple regression taught us that we shouldn’t model bill depth as a function of bill length without accounting for species. We built three separate models, one for each species. Another option is to build one model, but include species as an explanatory variable with the function.
This is very simple to implement in the lm() function.
We’ll try a function of the form
bill_depth_mm ~ bill_length_mm + species. This model will
estimate a single slope associated with bill length and a different
intercept for each species. The resulting prediction will look like
three parallel lines, one for each species.
# Drop NA data before fitting model. This helps me avoid problems down the line with predict()
penguins_lm_3 = penguins %>%
filter(!is.na(bill_depth_mm),
!is.na(bill_length_mm),
!is.na(species))
lm_3 = lm(bill_depth_mm ~ bill_length_mm + species, data=penguins_lm_3)
There are several different ways to access your model results. You
can copy and paste coefficient estimates, p-values, etc. from the
summary() output, or you can extract various elements from
the model variable using specialized functions like coef().
If you want your results to look like an ANOVA table (remember, ANOVA
and linear regression are mathematically equivalent), you can use the
anova() function. Probably the most efficient way to get at
your model results makes use of the tidy() function in the
broom package in the tidyverse. Note that
broom is one of the packages that, although it is part of
the tidyverse, it does not load up with the
library(tidyverse), it must be loaded separately with
library(broom). This outputs the model results in a neat
data frame, which you can easily export to a .csv file to
save or create a table in your manuscript. Check the help page
?tidy.lm and you can see that you can ask for confidence
intervals to be output on your coefficients.
summary(lm_3)
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm + species, data = penguins_lm_3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4529 -0.6864 -0.0508 0.5519 3.5915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.59218 0.68302 15.508 < 2e-16 ***
## bill_length_mm 0.19989 0.01749 11.427 < 2e-16 ***
## speciesChinstrap -1.93319 0.22416 -8.624 2.55e-16 ***
## speciesGentoo -5.10602 0.19142 -26.674 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9533 on 338 degrees of freedom
## Multiple R-squared: 0.769, Adjusted R-squared: 0.7669
## F-statistic: 375.1 on 3 and 338 DF, p-value: < 2.2e-16
coef(lm_3) # vector of coefficients
## (Intercept) bill_length_mm speciesChinstrap speciesGentoo
## 10.5921805 0.1998943 -1.9331943 -5.1060201
coef(lm_3)[1] # subset the y-intercept
## (Intercept)
## 10.59218
anova(lm_3) # Analysis of variance tables (typically used when all predictors are categorical)
## Analysis of Variance Table
##
## Response: bill_depth_mm
## Df Sum Sq Mean Sq F value Pr(>F)
## bill_length_mm 1 73.47 73.47 80.84 < 2.2e-16 ***
## species 2 949.16 474.58 522.17 < 2.2e-16 ***
## Residuals 338 307.20 0.91
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Recall broom packages is not automatrically loaded with tidyverse
# yet we can access the tidy() function by calling the package explicitly
broom::tidy(lm_3) # ?tidy.lm
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 10.6 0.683 15.5 2.43e-41
## 2 bill_length_mm 0.200 0.0175 11.4 8.66e-26
## 3 speciesChinstrap -1.93 0.224 -8.62 2.55e-16
## 4 speciesGentoo -5.11 0.191 -26.7 3.65e-85
broom::tidy(lm_3, conf.int = TRUE, conf.level = 0.95) # Added confidence intervals to output
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 10.6 0.683 15.5 2.43e-41 9.25 11.9
## 2 bill_length_mm 0.200 0.0175 11.4 8.66e-26 0.165 0.234
## 3 speciesChinstrap -1.93 0.224 -8.62 2.55e-16 -2.37 -1.49
## 4 speciesGentoo -5.11 0.191 -26.7 3.65e-85 -5.48 -4.73
broom::tidy(lm_3, conf.int = TRUE, conf.level = 0.95) %>%
mutate_if(is.numeric, round, 2) # round to 2 decimals
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 10.6 0.68 15.5 0 9.25 11.9
## 2 bill_length_mm 0.2 0.02 11.4 0 0.17 0.23
## 3 speciesChinstrap -1.93 0.22 -8.62 0 -2.37 -1.49
## 4 speciesGentoo -5.11 0.19 -26.7 0 -5.48 -4.73
Now let’s plot the data, the linear model predictions and the
standard errors on those predictions. Although it seems super similar,
recall that the plot we made at the end of our simple linear regression
section actually had the results of three distinct linear models plotted
onto the same figure: one for each species (plus the fourth model in
black where all species were combined into one simple linear
regression). This multiple regression model lm_3 that we
created is a single model with two explanatory variables, bill_length_mm
and species. The slopes associated with each species must be equivalent
in this model formulation.
The combined libraries ggiraph and ggiraphExtra have a really
convenient plot function for examining a model called
ggPredict(). You can set the parameters so that standard
errors are printed around the model and you can also make the plot
interactive. That means if you click on a point or a line, a box will
pop up with information on that datapoint:
library(ggiraph)
library(ggiraphExtra)
ggPredict(lm_3)
ggPredict(lm_3, se=TRUE, interactive=TRUE)
While ggPredict() is a fabulous function, there is not a
lot of room to customize it. The more formal and more customizable
method for accessing your model predictions is using the
predict() function in base R.
lm_3_predictions = predict(lm_3, interval="confidence", level=0.95) # Calculates lm predictions for the original dataset ?predict.lm
head(lm_3_predictions)
## fit lwr upr
## 1 18.40805 18.25507 18.56102
## 2 18.48800 18.33346 18.64254
## 3 18.64792 18.48673 18.80911
## 4 17.92830 17.75958 18.09702
## 5 18.44803 18.29442 18.60163
## 6 18.36807 18.21542 18.52072
head(penguins_lm_3)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## # ℹ 2 more variables: sex <fct>, year <int>
penguins_lm_3_predict = cbind(penguins_lm_3, lm_3_predictions)
head(penguins_lm_3_predict)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18.0 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## sex year fit lwr upr
## 1 male 2007 18.40805 18.25507 18.56102
## 2 female 2007 18.48800 18.33346 18.64254
## 3 female 2007 18.64792 18.48673 18.80911
## 4 female 2007 17.92830 17.75958 18.09702
## 5 male 2007 18.44803 18.29442 18.60163
## 6 female 2007 18.36807 18.21542 18.52072
ggplot(data=penguins_lm_3_predict, aes(x = bill_length_mm, y = bill_depth_mm, color = species) ) +
geom_point() +
geom_ribbon( aes(ymin = lwr, ymax = upr, fill = species, color = NULL), alpha = .1) +
geom_line(aes(y = fit), size = 1) +
theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The fitted lines in all the plots so far are different lengths. This
is because we have slightly different ranges of bill length data for
each species category in the dataset. By default when using
predict() we get the fitted values; i.e., the predicted
values from the dataset used in model fitting.
I think having different line lengths is fine here, but there are times when we want to draw each line across the entire range of the variable in the dataset. Also, sometimes our data are so sparse that our fitted line ends up not being very smooth; this is more likely to be a problem for non-linear model fits. In both of these situations we’d want to make a new dataset for making the predictions.
Let’s make model prediction lines using the entire range of bill
length instead of the within-species range. We can make a variable with
the full range of bill length via seq(), making a sequence
from the minimum to maximum dataset value. I use 0.1 as the increment in
seq(); the increment value you’ll want to use depends on
the range of your variable. Then to get the full range of bill length
associated with each species category we can use
expand.grid() in base R:
# Build a new bill_length_mm dataset that spans the full range of the original data at even intervals
newdata_bill_length_mm = seq(min(penguins_lm_3$bill_length_mm), max(penguins_lm_3$bill_length_mm), by = .1)
# Repeat complete bill_length_mm data for each species
newdata = expand.grid(bill_length_mm = newdata_bill_length_mm, species = unique(penguins_lm_3$species) ) # data.frame names must exactly match lm variables
head(newdata)
## bill_length_mm species
## 1 32.1 Adelie
## 2 32.2 Adelie
## 3 32.3 Adelie
## 4 32.4 Adelie
## 5 32.5 Adelie
## 6 32.6 Adelie
The key to making a dataset for prediction is that it must have every variable used in the model in it. You will get an error if you forget a variable or make a typo in one of the variable names. Note that the prediction dataset does not need to contain the response variable.
We use this prediction dataset with the newdata argument
in predict(). We can add the predicted values to the
prediction dataset using cbind(). When we make the plot of
the fitted lines now we can see that the line for each species covers
the same range. There are now two datasets used in the plotting code:
the original penguins_lm_3 for the data points and
newdata_predict_lm_3 for the predicted line and 95%
confidence intervals.
newdata_predict_lm_3 = cbind(newdata, predict(lm_3, interval="confidence", level=0.95, newdata = newdata))
dim(newdata_predict_lm_3)
## [1] 828 5
ggplot() +
geom_point(data=penguins_lm_3, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_ribbon(aes(ymin=lwr, ymax=upr, x = bill_length_mm, fill = species, color = NULL), alpha = .1, data=newdata_predict_lm_3) +
geom_line(aes(y = fit, x = bill_length_mm, color=species), size = 1, data=newdata_predict_lm_3) +
theme_bw()
Note that you could have drawn these model prediction lines for each
species with just 2 bill lengths at either end of the range of
predictions that you want (i.e. 30mm and 60mm), since 2 points is all it
takes to draw a line. However, see how the confidence intervals (from
geom_ribbon) are wider at the edges and narrower in the
middle? You would lose all of this detail in the confidence intervals,
and just draw a straight rectangle, if you generated predictions from
just 2 points.
Another way to generate model predictions is using the
augment() function in the broom package. This
function fits into the dplyr coding flow, so you call it with a pipe and
the model estimate columns automatically append to the data that you are
using. This avoids the extra call to cbind(), and perhaps a
mistake in joining data frames. Note that the augment()
function does not currently support confidence intervals other than the
95% confidence interval.
# Get model predictions
lm_3_predict = lm_3 %>%
broom::augment(data=penguins_lm_3, se_fit=TRUE, interval="confidence") # augment() instead of predict() # see ?augment.lm
#mutate(lwr = .fitted - 1.96 * .se.fit, upr = .fitted + 1.96 * .se.fit) # Calculate 95% C.I. using SE
# Plot the data and the model predictions
ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species), data=lm_3_predict) +
geom_point() +
geom_ribbon(aes(ymin = .lower, ymax = .upper, fill = species, color = NULL), alpha = .15) +
geom_line( aes(y = .fitted), size = 1)
Similarly, we can use augment() to generate predictions
with new data, so that our predicted model plots extend beyond the range
of the data used to fit the model. To stay within the tidyverse, we’ll
use tidyr::expand() in place of expand.grid()
to generate all possible combinations of the explanatory variables used
in our model.
# Get model predictions with new data
newdata = penguins_lm_3 %>%
tidyr::expand(bill_length_mm, species) # instead of expand.grid(), type ?tidyr::expand in R console
lm_3_predict = lm_3 %>%
broom::augment(newdata = newdata, se_fit=TRUE, interval="confidence") # instead of predict()
# Plot the data and the model predictions
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species), data=penguins_lm_3) +
geom_ribbon(aes(ymin = .lower, ymax = .upper, x = bill_length_mm, fill = species, color = NULL), alpha = .15, data=lm_3_predict) +
geom_line(data=lm_3_predict, aes(y = .fitted, x = bill_length_mm, color=species), size = 1)
Why am I showing you three completely different ways to generate the
same predictions and the same plot? Well, that’s the secret of
programming. There is always more than one way to get something done. By
trying out the different methods to accomplish the same task, you’ll
become a more versatile and efficient programmer. For your own linear
models, use the prediction method that makes the most sense for you.
However, one benefit of the tidyverse method is that if you wind up
running statistical analyses on Big Data and need to build many similar
models, these methods combined with tools in the tidyverse
purrr package make this very easy and efficient to
code.
I’m proud of that last plot so I’m going to make it a bit prettier and save it:
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species), data=penguins_lm_3) +
geom_ribbon(aes(ymin = .lower, ymax = .upper, x = bill_length_mm, fill = species, color = NULL), alpha = .15, data=lm_3_predict) +
geom_line(data=lm_3_predict, aes(y = .fitted, x = bill_length_mm, color=species), size = 1) +
theme_bw() +
xlab("Bill length (mm)") + ylab("Bill depth (mm)")
ggsave(filename = "figures/bill_depth_model.png", device = "png", width = 6, height = 3.5, units = "in", dpi = 300)
Build a multiple linear regression model where penguin
bill_depth_mm is a function of
flipper_length_mm + species. Use
ggPredict() to plot the data and the model predictions.
Recall our model lm_3:
lm_3 = lm(bill_depth_mm ~ bill_length_mm + species, data=penguins_lm_3)
We allow bill depth to depend on both the continuous variable bill
length and the categorical variable species, however, the species
independent variable only has the power to change the y-intercept of the
predictions for each of the three species. In this formulation, the
prediction lines associated with each species are parallel. But what if
this isn’t a good formulation for predicting bill depth? Perhaps, for
example, Chinstrap bill depth increases more slowly with increasing bill
length than Gentoo bill depth. To explore the possibility that the
relationship between bill depth and bill length changes for
each of the three species, we can add an interaction term. If we model
the interaction between bill length and species, the bill depth
prediction lines for each species are no longer forced to be parallel.
Variable interactions are indicated with the * sign in R
model formulas:
lm_4 = lm(bill_depth_mm ~ bill_length_mm*species, data=penguins_lm_3)
# lm_4 = lm(bill_depth_mm ~ bill_length_mm + species + bill_length_mm * species, data=penguins_lm_3) # Equivalent model formula
Note that bill_length_mm*species is short-hand for
bill_length_mm + species + bill_length_mm X species.
Interaction terms are coded this way because when you build a model with
an interaction, it is important to always include the interacting
variables on their own as well. You wouldn’t want a model that looked
like bill_depth_mm ~ bill_length_mm X species without
including bill length and species as additional terms because then the
interaction coefficients would be tasked with accounting for all of the
variation from each of those two independent variables in addition to
the interaction between those variables. This would make interpretation
difficult and may bias our understanding of the importance of the
interaction. If you want to model the interaction of two independent
variables, but suppress the inclusion of those variables on their own,
you can use a : in the model formula instead
(i.e. bill_depth_mm ~ bill_length_mm:species). Please think
critically and exercise caution if you choose to build this type of
model.
If we look at the model results for lm_3 side by side
with lm_4, we can see that the inclusion of the interaction
term is not helping our model very much. The \(R^2\) measure barely improves with
lm_4, and the \(Adjusted-R^2\) measure is actually lower.
This is because the \(Adjusted-R^2\)
measure is calculated from the \(R^2\)
measure, but it includes a penalty for model complexity (i.e. for each
additional model coefficient). None of the interaction coefficients in
lm_4 are statistically significant.
summary(lm_3)
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm + species, data = penguins_lm_3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4529 -0.6864 -0.0508 0.5519 3.5915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.59218 0.68302 15.508 < 2e-16 ***
## bill_length_mm 0.19989 0.01749 11.427 < 2e-16 ***
## speciesChinstrap -1.93319 0.22416 -8.624 2.55e-16 ***
## speciesGentoo -5.10602 0.19142 -26.674 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9533 on 338 degrees of freedom
## Multiple R-squared: 0.769, Adjusted R-squared: 0.7669
## F-statistic: 375.1 on 3 and 338 DF, p-value: < 2.2e-16
summary(lm_4)
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm * species, data = penguins_lm_3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6574 -0.6675 -0.0524 0.5383 3.5032
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.40912 1.13812 10.025 < 2e-16 ***
## bill_length_mm 0.17883 0.02927 6.110 2.76e-09 ***
## speciesChinstrap -3.83998 2.05398 -1.870 0.062419 .
## speciesGentoo -6.15812 1.75451 -3.510 0.000509 ***
## bill_length_mm:speciesChinstrap 0.04338 0.04558 0.952 0.341895
## bill_length_mm:speciesGentoo 0.02601 0.04054 0.642 0.521590
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9548 on 336 degrees of freedom
## Multiple R-squared: 0.7697, Adjusted R-squared: 0.7662
## F-statistic: 224.5 on 5 and 336 DF, p-value: < 2.2e-16
We can use the AIC() function to compare the Akaike Information
Criterion between the 2 models. AIC can be used to compare 2 different
models (they do not have to be “nested” models) as long as the models
were estimated using the same dataset. The AIC for lm_4 is
higher, meaning the fitness is poorer compared to lm_3.
Another option for comparing a complex model with a simpler model is to
use the step() function. You can input the most complicated
model formulation that you are intersted in and this function will
reduce the model’s complexity one step at a time and check the fitness
at each step. Read the documentation on step() to learn
more about how decisions are made.
AIC(lm_3, lm_4)
## df AIC
## lm_3 5 943.8504
## lm_4 7 946.8778
best_model = step(lm_4)
## Start: AIC=-25.68
## bill_depth_mm ~ bill_length_mm * species
##
## Df Sum of Sq RSS AIC
## - bill_length_mm:species 2 0.87243 307.20 -28.704
## <none> 306.32 -25.676
##
## Step: AIC=-28.7
## bill_depth_mm ~ bill_length_mm + species
##
## Df Sum of Sq RSS AIC
## <none> 307.20 -28.70
## - bill_length_mm 1 118.67 425.87 81.01
## - species 2 949.16 1256.36 449.00
summary(best_model)
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm + species, data = penguins_lm_3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4529 -0.6864 -0.0508 0.5519 3.5915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.59218 0.68302 15.508 < 2e-16 ***
## bill_length_mm 0.19989 0.01749 11.427 < 2e-16 ***
## speciesChinstrap -1.93319 0.22416 -8.624 2.55e-16 ***
## speciesGentoo -5.10602 0.19142 -26.674 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9533 on 338 degrees of freedom
## Multiple R-squared: 0.769, Adjusted R-squared: 0.7669
## F-statistic: 375.1 on 3 and 338 DF, p-value: < 2.2e-16
If we plotted the lm_4 predictions for each of the 3
species, the lines will probably look pretty close to parallel, since
the interaction terms were not significant. We can build our predictions
with the same methods we used for lm_3.
# Plotting predictions with ggPredict():
# ggPredict(lm_4, se=TRUE, interactive=TRUE) # easy, but less customizable
# Get model predictions
lm_4_predict = lm_4 %>%
augment(se_fit=TRUE, interval="confidence")
# Plot the data and the model predictions
ggplot(data=lm_4_predict) +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_ribbon(aes(ymin = .lower, ymax = .upper, x = bill_length_mm, fill = species, color = NULL), alpha = .15) +
geom_line(aes(y = .fitted, x = bill_length_mm, color=species), size = 1) +
theme_bw()
As we expected, the distinct slopes of bill depth vs. bill length for
each of the 3 species are almost parallel. Since the AIC values are
lower (i.e. better fit) for lm_3, we would select that as our final
model instead of lm_4. If you look carefully at the lm_4 coefficient
table (or if you hover over the species-specific model lines in the
interactive ggiraph model prediction plot), you’ll notice that for a
given species, the model coefficients are exactly equal to the
coefficients of a linear model for
bill_depth_mm ~ bill_length_mm using just the data for a
single species. For example, if you hover over the blue Gentoo model
prediction line, you’ll see the model formula is
y = 0.2*x + 5.25. Recall that the model lm_2 that we built
with the gentoo data had the same formula:
y = 0.2*x + 5.25.
When including multiple independent variables in a regression,
especially multiple continuous variables, you need to check the
collinearity of your predictors. If 2 predictors x1 and
x2 are extremely collinear, then you can bias your
interpretation of the model because you can’t objectively determine how
much variation in y is caused by x1
vs. x2. To deal with this, you should calculate the
Variance Inflation Factor of your independent variables. If the VIF = 1,
then your independent variables are not correlated. There are different
thresholds set for VIF, but generally, if the VIF > 4 or 5, you have
a moderate problem with multicollinearity and if the VIF > 10, you
have a big problem with multicollinearity in your model. This statistic
should be reported and model results should be interpreted with the VIF
in mind. If you are more interested in a mechanistic model
(i.e. figuring out WHAT drives y) rather than a predictive
model (i.e. getting the BEST estimate of y) then you may
need to use your best human judgement to remove predictor variables
until the model results are more interpretable.
Let’s try out a regression with 2 continuous variables. We’ll look at
Gentoo penguin bill depth as a function of bill length, flipper length
and body mass. We’ll use AIC() and step() to
compare this complex model with the simpler models. We’ll also use
car::vif() to check if we have a problem with
multicollinearity.
library(car) # vif()
library(ggiraph)
library(ggiraphExtra) # ggPredict()
gentoo = penguins %>%
filter(species=="Gentoo")
# Build simple linear regression
lm_gentoo_1 = lm(bill_depth_mm ~ bill_length_mm, data=gentoo)
# Build multiregression with 2 variables
lm_gentoo_2 = lm(bill_depth_mm ~ bill_length_mm + flipper_length_mm, data=gentoo)
# Build multiregression with 3 variables
lm_gentoo_3 = lm(bill_depth_mm ~ bill_length_mm + flipper_length_mm + body_mass_g, data=gentoo)
AIC(lm_gentoo_1, lm_gentoo_2, lm_gentoo_3) # lm_gentoo_3 performs best
## df AIC
## lm_gentoo_1 3 283.6673
## lm_gentoo_2 4 251.9633
## lm_gentoo_3 5 236.7460
best_model = step(lm_gentoo_3) # Doesn't remove any variables
## Start: AIC=-114.31
## bill_depth_mm ~ bill_length_mm + flipper_length_mm + body_mass_g
##
## Df Sum of Sq RSS AIC
## <none> 45.503 -114.313
## - bill_length_mm 1 1.8206 47.323 -111.487
## - flipper_length_mm 1 5.6252 51.128 -101.976
## - body_mass_g 1 6.8367 52.339 -99.096
summary(best_model)
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm + flipper_length_mm +
## body_mass_g, data = gentoo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.45525 -0.41800 0.03198 0.29458 2.28070
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.2009521 2.1567620 -1.020 0.309567
## bill_length_mm 0.0572019 0.0262149 2.182 0.031072 *
## flipper_length_mm 0.0499049 0.0130113 3.836 0.000202 ***
## body_mass_g 0.0007145 0.0001690 4.228 4.64e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6184 on 119 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.6126, Adjusted R-squared: 0.6028
## F-statistic: 62.73 on 3 and 119 DF, p-value: < 2.2e-16
vif(lm_gentoo_3) # vif values ~ 2, mild multicollinearity
## bill_length_mm flipper_length_mm body_mass_g
## 2.082527 2.271573 2.315374
The Variance Inflation Factors were all around 2, indicating that we
don’t have any serious multicollinearity. When we tested our complex
model with 3 continuous variables in the step() function,
the full model was retained. Comparisons between the AIC values for each
of the 3 increasingly complex models also indicates that the most
complex model has the lowest AIC by a significant margin (i.e. >2 AIC
units). That means we should keep the complex model
lm_gentoo_3.
Now let’s plot the model predictions. This is where things get a bit
tricky, because we are interested in changes along four dimensions (bill
depth, bill length, flipper length and body mass) but we can only easily
visualize 2 dimensions in our plots. The simplest solution here is to
plot changes in our y variable against changes in our x variables one at
a time, while holding the other x variables constant. A good way to do
this is to set 2 of your x variables to their median value in the
observations, and then look at how y changes with the remaining x
variable. We can do this by generating a new data set, and then using
the augment() function in the tidyverse (or
predict() in base R) and plotting just like we did
earlier:
### Look at bill depth ~ body mass while holding bill length and flipper length constant
# Use expand to get full range of 1 variable, then add in median of other variable(s) from original data
newdata = gentoo %>%
select(body_mass_g) %>% # or distinct(body_mass_g) to remove repeats
mutate(bill_length_mm = median(gentoo$bill_length_mm, na.rm=TRUE),
flipper_length_mm = median(gentoo$flipper_length_mm, na.rm=TRUE))
lm_gentoo_3_predict = lm_gentoo_3 %>%
broom::augment(newdata = newdata, se_fit=TRUE, interval="confidence") # ?augment.lm
# Plot the data and the model predictions
ggplot(data=lm_gentoo_3_predict) +
geom_point(aes(x = body_mass_g, y = bill_depth_mm), data=gentoo) + # original data
geom_ribbon(aes(ymin = .lower, ymax = .upper, x = body_mass_g), alpha = .15) +
geom_line(aes(y = .fitted, x = body_mass_g)) +
annotate("text", x=4250, y=17, label= paste0("flipper length = ", median(gentoo$flipper_length_mm, na.rm=TRUE), "mm")) +
annotate("text", x=4250, y=16.5, label= paste0("bill length = ", median(gentoo$bill_length_mm, na.rm=TRUE), "mm"))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
Plot the model predictions from our model lm_gentoo_3 so
that we can see the variation in bill depth vs. flipper length while
holding bill length and body mass constant at their medians.
Analysis of Variance (ANOVA) is just a special case of linear
regression where the independent predictor variables are all categorical
variables. Mathematically, running an ANOVA is the same as running a
linear model. If you plan to include continuous (numerical) predictor
variables, this is called ANCOVA (Analysis of Covariance). To run an
ANOVA / ANCOVA, you can build your linear model using lm()
as before, and then print the output of the model using the function
anova(), which presents the statistical output in a classic
ANOVA table. I always like to emphasize how there are multiple ways to
accomplish the same goals in programming! Another way to run an ANOVA is
to use the aov() function to run an ANOVA model
directly.
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
# Conduct an ANOVA using lm()
penguin_lm = lm(body_mass_g ~ species + sex, data=penguins)
summary(penguin_lm)
##
## Call:
## lm(formula = body_mass_g ~ species + sex, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -816.87 -217.80 -16.87 227.61 882.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3372.39 31.43 107.308 <2e-16 ***
## speciesChinstrap 26.92 46.48 0.579 0.563
## speciesGentoo 1377.86 39.10 35.236 <2e-16 ***
## sexmale 667.56 34.70 19.236 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 316.6 on 329 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.8468, Adjusted R-squared: 0.8454
## F-statistic: 606.1 on 3 and 329 DF, p-value: < 2.2e-16
anova(penguin_lm)
## Analysis of Variance Table
##
## Response: body_mass_g
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 145190219 72595110 724.21 < 2.2e-16 ***
## sex 1 37090262 37090262 370.01 < 2.2e-16 ***
## Residuals 329 32979185 100241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Conduct the same ANOVA using aov()
penguin_anova = aov(body_mass_g ~ species + sex, data=penguins)
summary(penguin_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 145190219 72595110 724.2 <2e-16 ***
## sex 1 37090262 37090262 370.0 <2e-16 ***
## Residuals 329 32979185 100241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 11 observations deleted due to missingness
Note that the output of the anova(penguin_lm) has all of
the same statistics and results as summary(penguin_anova).
The output tells us that both species and sex are significant predictors
of penguin body mass, but it doesn’t tell us more than that. To find out
which groups are associated with heavier vs. lighter body mass, simply
calculate and compare the mean of the the two groups. With groups with 3
or more categories, we need to conduct a post hoc Tukey
test.
# which sex has higher body mass?
penguins %>%
group_by(sex) %>%
summarize(mean_body_mass_g = mean(body_mass_g))
## # A tibble: 3 × 2
## sex mean_body_mass_g
## <fct> <dbl>
## 1 female 3862.
## 2 male 4546.
## 3 <NA> NA
# which species has higher body mass?
penguins %>%
group_by(species) %>%
summarize(mean_body_mass_g = mean(body_mass_g, na.rm=TRUE))
## # A tibble: 3 × 2
## species mean_body_mass_g
## <fct> <dbl>
## 1 Adelie 3701.
## 2 Chinstrap 3733.
## 3 Gentoo 5076.
TukeyHSD(penguin_anova) # Requires the output of the aov() function
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = body_mass_g ~ species + sex, data = penguins)
##
## $species
## diff lwr upr p adj
## Chinstrap-Adelie 26.92385 -82.51532 136.363 0.8313289
## Gentoo-Adelie 1386.27259 1294.21284 1478.332 0.0000000
## Gentoo-Chinstrap 1359.34874 1246.03315 1472.664 0.0000000
##
## $sex
## diff lwr upr p adj
## male-female 667.4577 599.193 735.7224 0
So the post-hoc Tukey test tells us that Chinstrap and Adelie penguin
body mass is not significantly different (p>0.05), however, both
Chinstrap and Adelie penguins body mass are each significantly different
from Gentoo penguins (p<0.05). The diff column in the
TukeyHSD() output shows the difference between the mean
body mass for the first species minus the second species. We can also
simply calculate the body masses for each species separately to find and
compare the mean body mass.
ANOVA is pretty robust to not-quite-normal data, but if you have a
big normality problem you can switch to the Kruskall Wallace test
kruskal.test().
Conduct an Analysis of Variance to determine whether Adelie body mass is significantly different between the three islands where observations were collected. Conduct a post-hoc Tukey test if appropriate.
Overview of basic regression capabilities in R: https://www.statmethods.net/stats/regression.html
Diagnostic tests for linear regression: https://www.statmethods.net/stats/rdiagnostics.html https://ianruginski.netlify.app/post/regressionassumptions/
ggpredict() vignettes: https://cran.r-project.org/web/packages/ggiraphExtra/vignettes/ggPredict.html